Why the starting values of your neural network's weights decide if it learns at all — and how to get them right.
Before a neural network learns anything, every weight must be given an initial value. This isn't just a technicality — it's the difference between a network that converges in minutes and one that never learns at all.
If every string starts wildly out of tune, the musician (optimizer) has to make huge corrections that may overshoot. If every string starts at the exact same note, you can't play a chord — you need variety. The ideal: each string starts close to its correct pitch, with just enough variation.
The golden rule: keep activation variance ≈ constant across every layer.
If variance shrinks layer by layer, gradients vanish (network stops learning). If variance grows, gradients explode (network becomes unstable). Good initialization keeps the signal stable as it flows forward and backward.
All weights identical → all neurons compute the same thing → network collapses to 1 neuron per layer.
Weights too small → activations shrink exponentially through layers → deep layers never update.
Weights too large → activations blow up → loss becomes NaN, training crashes.
Watch what happens to the activation distribution as a signal passes through a 10-layer network with different init schemes. Each bar shows the spread (variance) of activations at that layer.
See how different initializations affect signal propagation
Weights drawn from N(0, 0.01). Worked for 2–3 layer nets. Deeper? Gradients vanished — activations shrank to zero by layer 5.
Insight: balance the variance for both forward and backward passes. For a layer with n_in inputs and n_out outputs, average them:
Designed for sigmoid / tanh (approximately linear near zero). Broke for ReLU, which kills half the signal.
ReLU zeros out ~50% of values, so variance drops by half each layer. Fix: double the variance.
That factor of 2 is the entire difference. Enabled ResNet-50 and beyond.
Linear layer: y_j = Σ W_ji × x_i (sum over n_in inputs)
Assume inputs are i.i.d. with Var(x)=1, zero mean, and W independent of x.
Then Var(y) = n_in × Var(W) (variance of a sum of independent products).
Want Var(y) = 1 (same as input) → set Var(W) = 1/n_in.
Backward pass gives the same constraint but with n_out → Var(W) = 1/n_out.
Compromise: average both → Var(W) = 2/(n_in + n_out). Done!
Same setup: Var(z) = n_in × Var(W) (pre-activation).
After ReLU: half the values become zero → Var(ReLU(z)) ≈ ½ × Var(z).
So Var(y) = ½ × n_in × Var(W). Want this = 1.
Solve → Var(W) = 2/n_in. The factor of 2 compensates for ReLU's 50% kill rate.
Only uses n_in (fan-in), because ReLU's unbounded output makes forward-pass control more important.
In practice, this is the flowchart that matters:
| Architecture | Initialization | Why |
|---|---|---|
| CNNs (ResNet) | He Normal | ReLU activations need the 2× factor |
| Transformers (GPT, BERT) | Xavier Normal | LayerNorm stabilizes; GELU ≈ linear near 0 |
| ViT | Trunc Normal σ=0.02 | Stabilizes patch embeddings |
| RNN / LSTM | Orthogonal + Xavier | Orthogonal prevents exploding through time |
| GAN Generator | N(0, 0.02) | Stabilizes fragile adversarial training |
| GAN Discriminator | He Normal | Leaky ReLU typical |
Embeddings → N(0, 0.02) | Q/K/V, FFN → Xavier | LayerNorm → γ=1, β=0 | Output proj (deep) → scale by 1/√num_layers
Hidden layers → zero. Output bias for imbalanced classes → log(p / (1-p)) for faster convergence.
γ=1, β=0 → starts as identity. Reduces sensitivity to weight init in earlier layers.
Keep pretrained weights. Only init new head (Xavier/He). Use tiny LR: 1e-5 to 1e-6.
For 100+ layer nets: init residual branch scale α=0 so block starts as identity: y = x + 0·F(x).
PyTorch defaults (nn.Linear, nn.Conv2d) already use Kaiming Uniform. You usually only override for transformers, GANs, or LSTMs.